-
Notifications
You must be signed in to change notification settings - Fork 42
Use MMI not CTC model for alignment #203
base: master
Are you sure you want to change the base?
Conversation
Below are some notes I made about results. There is a modest improvement of around 0.3% absolute on test-other, from using the MMI not CTC model for alignment.
|
x = nnet_output.abs().sum().item() | ||
if x - x != 0: | ||
print("Warning: reverting nnet output since it seems to be nan.") | ||
nnet_output = nnet_output_orig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GNroy perhaps this is related to the error you had? I found that sometimes I'd get NaN's in the forward pass of the alignment model. I commented out ali_model.eval()
as well as making this change, because I suspected that it had to do with test-mode batchnorm, but I might have been wrong, I need to test this. It might relate to float16 usage (or a combination of the two).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Actually, I resolved my issue.
NaNs were produced by the encoder part (not a loss or softmax problem as I thought before).
It was fixed with some hyperparameters re-tuning. In particular, setting eps=1e-3 for the optimizer helped.
Results after training with 1 job only (and uncommenting ali_model.eval(), which I doubt it matters), were:
vs. the checked-in results from @zhu-han which were:
... so according to this, it does not really make a difference which model we use for alignment. |
Would it make sense to use a pure TDNN/TDNNF/CNN model for alignments? I was investing alignments from the conformer recently and my feeling was that they weren't perfect (even though the test-clean WER is ~4%) -- i.e., they seem a bit warped/shifted sometimes, but not in a consistent way. I think that the self-attention layers allow to "cheat" to some extent with the alignments, I don't know if the same happens with RNN. I doubt that the same would happen with local-context models though. Unfortunately, I don't have any means to provide a more objective evaluation than showing a screenshot (look closely at the boundaries with silences). |
That's interesting, how did you obtain that plot? I am thinking it might be possible, though, if we had model that was good for alignment, to save 'constraints' |
I'll submit a PR with the code that allows computing alignments and visualizing them later. As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader. |
cool!
…On Wednesday, June 2, 2021, Piotr Żelasko ***@***.***> wrote:
I'll submit a PR with the code that allows computing alignments and
visualizing them later.
As to data augmentation of alignments, we could extend most transforms to
handle it -- I'm pretty sure we can still do speed perturbation, noise
mixing, specaug masks (but probably not the warping). We don't have reverb
in Lhotse yet, but probably it's straightforward as well. Batching is
possible too, but I think the alignments would need to be a part of Lhotse
rather than external to it, so we can process them properly with everything
else in the dataloader.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO355IV3GUMJCN54IFTTQY3RVANCNFSM45565RSQ>
.
|
Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important. |
Mm, I would have expected the MMI one would be better; but since we're just
using this at the start of training to guide the model towards plausible
alignments, it could be that the difference gets lost by the end.
…On Tue, Jul 13, 2021 at 1:21 AM Piotr Żelasko ***@***.***> wrote:
Regarding this: it's actually weird that CTC and MMI alimdl would not make
a difference. Some time ago, I think I looked at both CTC and MMI
posteriors, and they are much different -- the CTC posteriors are spiky,
and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids
whereas CTC tends to recognize one phone id followed by blank). The way
alimdl's posteriors are added to the main model's posteriors, I'd think it
would be important.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3JJUCLR6JWKDUK6H3TXMQAXANCNFSM45565RSQ>
.
|
No description provided.